Main features of interest in your dataset?
Based on the above analysis the PropserScore, BorrowedAPR and the Loan Amount seem to be features of interest
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Prosper Loan Dataset: The original Prosper Loan Dataset contains information of loans cetgorized across 81 variables. For the purpose of this analysis we are subsetting this dataset to analyse the data pertaining to 16 variables namely ‘CreditGrade’,‘Term’,‘LoanStatus’,‘BorrowerAPR’, ‘BorrowerRate’,‘ListingCategory..numeric.’,‘BorrowerState’,‘Occupation’, ‘EmploymentStatus’,‘EmploymentStatusDuration’,‘IsBorrowerHomeowner’, ‘IncomeRange’,‘LoanNumber’,‘LoanOriginalAmount’,‘LoanOriginationDate’ and ‘LoanOriginationQuarter’.
## 'data.frame': 113937 obs. of 16 variables:
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
Understanding the range of values for each column:
CreditGrade:
## [1] "" "A" "AA" "B" "C" "D" "E" "HR" "NC"
Term in Months:
## [1] 12 60
Term in Months Distribution:
##
## 12 36 60
## 1614 87778 24545
*** Converting Term to factors ***
Loan Status:
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (1-15 days)" "Past Due (16-30 days)"
## [9] "Past Due (31-60 days)" "Past Due (61-90 days)"
## [11] "Past Due (91-120 days)" "Past Due (>120 days)"
Loan Status Distribution:
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
Borrower APR:
## [1] 0.00653 0.51229
Loan Original Amount:
## [1] 1000 35000
Employment status duration in months:
## [1] 0 755
Prosper score for Loans:
## [1] 1 11
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
** Term distribution of loans **
## 12 36 60
## 1614 87778 24545
It is observed that all the loans are either of 12,36 or 60 months duration. The maximum number of loans have a term of 36 months followed by 60 month term loans.
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
It is observed that max loans have no credit grade assigned. Among the rest ‘C’ Credit Grade loans are marginally higher than the rest grades.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
It is observed that approximately 5000 loans have defaulted status and maximum loans in the dataset are current. The pst due loans are comparitveley very less.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
It is observed that maximum loans have been issued at an APR of 36%.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
It is observed that maximum loans have been issued to Employed people and very few loans have been issued to not employed and part time people.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
The histogram is a perfect bell curve and shows that max loans have been given to the middle income groups with $25K-49K and $50K-74.99K and the loans fall off on either side as the income rises and falls.
## $title
## [1] "LoanOriginalAmount Histogram"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The most frequently sanctioned Loan amounts seems to be 4K$, 10K and 15K dollars. There are also loans sanctioned for 35K dollars.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.774 3.000 20.000
Maximum loans have been sanctioned under listing catgeory 1(Debt Consolidation) followd by 0(Not Available) and 7(Other). Very few loans have been sanctioned to 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans followed by 18 - Taxes, 19 - Vacation, 20 - Wedding Loans.
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
Max lons have been sanctioned to people from state of California followed by approximatley equal number of loans to people from Florida, Illinois, New York and Texas.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
It is observed that max loans have been given to people who have only spent a month or two in their employment status and there is a clear downward march of loans given out decreasing as the employment status duration increases.
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456 29084
It is observed that max loans have a prosper score of 4,6 and 8 while there are comparitevely very few loans having the worst score of 1 and best score of 11.
Based on the above analysis the PropserScore, BorrowedAPR and the Loan Amount seem to be features of interest
IsBorrowerHomeowner, Term and EmploymentStatusDuration are the features which could support the investigation.
The employment status duration histogram was unexpected as it showed that the loans granted to people decreased as their duration in the respective employment increased. It perhaps could be attributed to the fact that max loans were given out to new employees upon getting a job within the first two months.
We see that maximum loans have been given to non home owners as compared to home owners across all the prosper score ratings.
##
## 12 36 60
## 1614 87778 24545
There are only three values of term ie 12,36 and 60 months.We can see that the median BorrowerAPR is approximately the same for all three terms. But the range of BorrowerAPR values is much larger for 36 Months Term Loans than others.
As the Loan amount increases the Borrower APR moves to a range between 0.1 to 0.2 from 0.05 to 0.4. There seems to be a negative corelation between Borrower APR and Loan Amount.
We want to improve the above plot by coverting the employment status duration from days to months.
We now see that Employment status duration month wise has a similar range of APR for all months. However we also observe that many loans with varying values of employment status duration have an APR whci is 0.35-0.36 percentage points.
Here also we shall try to change the value of EmploymentStatus duration from days to months to revisit this graph
We now see that Employment status duration for varying montha has similar Loan Amount distribution. But we also notice that maximum loans in each value of Employment status duration have a Loan Amounts of 10K\(,15K\), 20K$ and 25K$.
We see that median Loan Amount increases with increase in the term distribution.
We see that as the PropserScore increases the Median LoanOriginalAmount also tends to increase.
The median Employment status duration does not seem to change much with the propser score.
We can see that the median Borrower APR decreases with increasing Prosper Score..Hence we would want to further analyze this relationship as under. We shall group the data by prosper score and try to find the Borrower APR mean for each group and then plot the mean vs the Prosper Score. To do this we shall use ‘dplyr’ package and derive a dataframe by grouping the existing dataframe by ProsperScore.
As expected we find a very close inverse linear relationship between ProsperScore and the mean Borrower APR. We shall also try and find a relationship between the ProsperScore and the Loan Amount sanctioned.
We see that there is an increasing linear relationship between the mean Loan Amount and the Prosper Score.
By grouping the loans using the ProsperScore and then finding the BorrowerAPR mean of each group we were able to plot and see the relationship between the ProsperSCore and BorrowerAPR mean as well as ProsperScore and LoanOriginalAmount. We saw that highly rated loans had more Loan Amount Sanctioned and lesser Borrower APR.
We see that across the prosper score ratings People who are not Home owners(red color) tend to have higher Borrower APR. We also see that as the propser score increases (gets better) the Borrower APR goes down.
We wanted to further understand the relationship between Loan Original Amount and Borrower APR. By colouring the data using the HomeOwner variable we see that maximum loans having APR from 0.05 to 0.4 have been given to the HomeOwners. Loans with Loan Amounts more than 0.4 are mostly given to People without home ownership.
We wanted to extend the analysis of ProsperScore and the LoanOriginalAmount using the home owner variable. From the graph above we see that People who are not home owners form the majority of loans that have lesser loan original amounts as compared to loans of home owners.Loans upward of 10K$ are mostly concentrated with Home owners. It is also seen that Loans with prosper score of 9 and above are predominantly of Home owners.
We have used the IsBorrowerHomeowner variable to extend our analysis between the ProsperScore , BorrowerAPR against the LoanOriginalAmount. The aim was to see if the variable IsBorrowerHomeOwner would give us any indication about the interplay of these three variables.From the above two graphs we can conclude that Loans given to Homeowners have better APR, Larger LoanAmounts and better prosper scores as compared to non home owners.
The mean Loan Amount of loans grouped by ProsperScore has a clear approximate increasing linear relationship with the ProsperScore. It can be understood logically that highly rated loans have a larger Loan Amount than loans with poor rating.
As an inverse of the previous graph we can see here that the APR mean falls with increasing Prosper Score. Again this can be understood in a way that the loans with higher Prosper SCore have a mean APR lower than the loans with poorer Prosper Scores.
We see that across the prosper score ratings People who are not Home owners(red color) tend to have higher Borrower APR. We also see that as the propser score increases (gets better) the Borrower APR goes down.
It was a difficult process to analyze this dataset. The intial difficulty lay in trying to guess the interesting variables for analysis from a total of 81 variables. The distributions of the variables helped understand the nature of the data and the distribution which in turn help guess the pair of variables that could seem to have some relationship between them. After narrowing down to the interesting relationship between Borrowed APR and Loan Amount it was decided to explore this relationship further using the ProsperScore variable. Finally we succeeded in establishing that both the Loan Amount and Borrowed APR varied in a linear fashion compared to the ProsperScore variable.
Further work can include fitting a model to Final Plots One and Two..